Multi-threaded Microprocessors - Evolution or Revolution

نویسنده

Chris R. Jesshope

چکیده

Threading in microprocessors is not new, the earliest threaded processor design was implemented in the late 1970s and yet only now is i t being used in mainstream microprocessor architecture. This paper reviews threaded microprocessors and explains why the more popular option of outof-order execution has a poor future and is not likely to provide a pathway for future microprocessor scalability. The first mainstream threaded architectures are beginning to emerge but unfortunately based on out-of-order execution. This paper will review the relevant trends in multi-threaded microprocessor design and look at one approach in detail, showing how wide instruction issue can be achieved and how it can provide excellent performance, latency tolerance and above all scalability with issue width. This model exploits ILP and loop level parallelism using a vector-like instruction set in a chip multiprocessor. 1 The Forces at Play in ISA Design There are two forces that determine the form and function of microprocessor architecture today. The first is the technology and the second is the market. These forces are quite at odds with each other. On the one hand, technology is all about change. In 1965, Intel’s founder Gordon Moore predicted that the number of transistors on a chip would double every 2 years. His prediction of exponential growth has not only been achieved but in some cases exceeded. On the other hand, the market is all about inertia or lack of change. At ACAC 2000, the invited speaker Rumi Zahir, who led the team responsible for the instruction set architecture of IA-64, told us an anecdotal story about the briefing his team had been given by Andy Grove. They were given a clean sheet to do whatever they wanted, but with one exception... the resulting microprocessor should be able to boot up a binary of DOS from floppy disc! In the event, Moore’s law solved their problem and the Itanium core processor is not binary compatible with X86 processors, instead it has a separate compatibility unit in hardware to provide IA32 compatibility. There are two routes to ISA development, evolutionary or revolutionary and it appears that the evolutionary route always relies on technological improvements and results in ever increasing complexity in design. We have good examples of this in current out-of-order issue superscalar microprocessors. Intel has demonstrated this approach, requiring each new ISA to be backward compatible with the previous one. On the other hand revolutionary change has been made, for example Motorola and IBM moved away from their respective CISC ISAs to the RISC-based Power PC architecture, first introduced in 1993. Such a major divergence in machine code forced Apple, a major user of the 68000 processor, to emulate the 68000 ISA on the Power PC for backward compatibility. Emulation has been used by a number of other microprocessor designs, including the Transmeta Crusoe, which was targeted at high performance but low-power applications. The benefits of speed and power savings made software emulation a practical alternative to hardware compatibility. Perhaps we should first ask what the issues are that require changes to an ISA design as we follow the inevitable trends of Moore’s law? In fact there is just one issue and that is in providing support for concurrency within the ISA. More and more gates mean increased on-chip concurrency, first in word width, now in instruction issue width. The move to a RISC ISA was revolutionary, it did not introduce concurrency explicitly, rather it introduced a simple, regular instruction set that facilitated efficient instruction execution using pipelines. In fact many people forget that the simplicity of RISC was first adopted in order to squeeze a full 32-bit microprocessor onto a single chip for the first time. RISC has also been introduced as an evolutionary development, for example, Intel’s IA32 CISC ISA, which has a very small set of addressable registers, is implemented by a RISC engine with a much larger actual register file. This is achieved by dynamically translating its externally visible CISC ISA into a lower-level RISC ISA. Of course this is only possible due to the inexorable results of Moore’s law. Intel was able to maintain backward compatibility in the IA32 from the 8086 in 1978 through to the Pentium 4 first introduced in 2000 but have now moved to a new ISA, which introduces a regular and explicit concurrent instruction set. 2 Concurrency in ISAs Concurrency can be introduced into a computer’s operation via the data that one instruction processes or by issuing instructions concurrently. In this paper we do not consider the data parallelism found in SIMD or vector computers, although we do look at a vector model of programming that is supported by wide instruction issue. Neither do we consider the data flow approach. This leaves just two ways in which concurrency can be introduced explicitly into conventional ISAs, through VLIW or through multi-threading. There is a third way, which is that currently used by most commercial microprocessors. This is to extract the concurrency from a sequential instruction stream dynamically in hardware. We will look at each of these in turn beginning with the excesses of the latter in terms of consuming silicon real-estate. 2.1 Out-of-order Instruction Execution Out-of-order instruction execution can be seen as a theoretically optimal solution for exploiting ILP concurrency, because instructions are interleaved in the wide-issue pipelines in close to programmed order, whilst honouring any data and control dependencies or indeed any storage conflicts introduced by the out-of-order instruction execution. The major benefit is that it is achieved using the existing sequential instruction stream and therefore maintains code-base compatibility. In effect, the instruction stream is dynamically decomposed into micro-threads, which are scheduled and synchronised at no cost in terms of executing additional instructions. Although this may be desirable, speedups using out-of-order execution on superscalar pipelines not so impressive and it is difficult to obtain a speedup of greater than 2, even on regular code and using 4or 8-way superscalar issue, e.g.[1]. Moreover, they scale rather badly as issue widths are increased. To understand why this is, let us first look at how a typical superscalar pipeline works. Instructions are prefetched, sometimes along more than one potential execution path. Instructions are then partially decoded and issued to an instruction window, which holds instruction waiting to be executed. Instructions can be issued from this window in any order, providing resource constraints can be met by register renaming. Instructions are then issued to reservation stations, which are buffers associated with each of the execution units. Here a combination of register reads and bypassing, using tagged data, matches each instruction to its data. When all data dependencies have been satisfied, the instructions can be executed. Eventually an instruction will be retired in program order by writing data into the ISA visible registers to ensure sequential execution machine state. The first and most significant problem with this approach is that execution must proceed speculatively and even though there is a high probability of control hazards being correctly predicted[2], this must but put into context. As a rule of thumb, a basic blocks is often no longer than 6 instructions[3] and if we assume a 6-way instruction issue superscalar microprocessor with 6 pipeline stages before the branch condition is resolved[4], we are likely to have of the order of 6 branches unresolved at any time. Even with a 95% successful prediction rate for each branch, there is a 1 in 4 chance of failure in any cycle. With unpredictable branching, the situation is much worse and branch prediction failure is almost guaranteed in any cycle (98% chance of failure). These parameters will also limit multi-path prefetching, as instruction fetch and decode bandwidth is exponential in the number of unresolved brunches. In other words we could be fetching and decoding up to 64 different instruction paths in a multi path approach. A second problem is that of sequential-order or deterministic machine state, which lags significantly behind instruction fetch due to the many pipeline stages used in out-of-order execution. This means there are significant delays on nondeterministic events, such as on an interrupt or an error, caused by the miss prediction of a branch condition for example. Recovery for miss prediction therefore can have a very high latency. The final problem is one of diminishing returns for available resources[5], which we will look at in more detail below. Out-of-order executions requires large register files, large instruction issue windows and large caches. As the issue width increases, both the number of register ports and hence the size of the register file must both increase. The physical size of the register file increases more than quadratically with instruction issue width[1] and this is largely due to the size of the register cell, which requires both horizontal and vertical busses for each port. The proposed Alpha 21464 illustrates this problem very well[6], its register file comprises 512 64 bit registers and occupies an area over four times the size of the L1 D-caches of 64KB. The area of the instruction window also grows with issue width. It can be thought of as a sliding window over the code stream within which concurrency can be extracted, it grows with the square of the number of entries due to the scoreboard logic that that controls instruction issue. The 21464 has 128 entries. It must be large so as to not unduly limit the potential ILP that may be exploited in an out-of-order issue. The problem is compounded because out-of-order execution introduces additional dependencies (WAR and WAW), which are resolved by register renaming and drive up the size of the register file. These are not real dependencies but simply resource conflicts. Again the proposed 21464 illustrates the problem well, the 128 entry out-of-order issue queue + renaming logic is approximately ten times the size of the L1 I-cache, also 64KB. Finally, out-of-order issue increases the complexity of the memory hierarchy, both in levels of cache implemented and in prefetching and cache management techniques. It is well known that caching produces only diminishing returns in terms of performance for chip area occupied and current L2 cache arrays will typically occupy between 1/3 and 1/2 of the total chip area[6]. Clearly something is very wrong with the out-of-order approach to concurrency if this extravagant consumption of on-chip resources is only providing a practical limit on IPC of about 2. Having said that further improvements in IPC have been observed in Simultaneous Multi-Threaded (SMT) architectures, which are based on the superscalar approach. However we have to consider whether this is the most appropriate solution, adding yet further hardware resources to an already non-scalable approach to increase instruction issue width still further. Note that the 21464 [6] is an SMT supported out-of-order issue architecture.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Microprocessors, Digital Signal Processors and Microcontrollers

1. Microprocessors 1.1. Basic Definitions of Microprocessors 1.2. The Early Days of Microprocessors 1.3. The Era of RISC Microprocessors 1.4. Superscalar Microprocessors 1.5. VLIW Microprocessors 1.6. CISC, RISC, and VLIW Architectures Comparison 1.7. Multi-threaded and Multi-core Microprocessors 1.8. Future Directions for Multi-core Microprocessors 2. Digital Signal Processors 2.1. A Bird’s Ey...

متن کامل

Evaluating a Multithreaded Superscalar Microprocessor versus a Multiprocessor Chip

This paper examines implementation techniques for future generations of microprocessors. While the wide superscalar approach, which issues 8 and more instructions per cycle from a single thread, fails to yield a satisfying performance, its combination with techniques that utilize more coarse-grained parallelism is very promising. These techniques are multithreading and multiprocessing. Multi-th...

متن کامل

Performance of a Micro-threaded Pipeline

The micro-threaded microprocessor is a chip multi-processor, which uses a multi-threaded approach, where the threads are obtained from within a single context and exploit both vector and instruction level parallelism (ILP). This approach employs vertical and horizontal transfer in a simple pipeline. The horizontal transfer is referred to as the normal scalar pipeline processing used in most mic...

متن کامل

Algebraic Models of Simultaneous Multi-Threaded and multi-core Microprocessors

Superscalar microprocessors execute multiple instructions simultaneously by virtue of large amounts of (possibly duplicated) hardware. Much of this hardware is idle at least part of the time. simultaneous multithreaded (SMT) microprocessors utilize this idle hardware by interleaving multiple independent execution threads. In essence, a single physical processor appears to be multiple virtual pr...

متن کامل

Performance Limits Due to Inter-Cluster Data Forwarding in Wire-Limited ILP Microprocessors

The growing speed gap between transistors and wire interconnects is forcing the development of distributed, or clustered, architectures. These designs partition the chip into small regions with fast intra-cluster communication. Longer latency is required to communicate between clusters. The hardware and/or software is responsible for scheduling instructions to clusters such that critical path c...

متن کامل

Thread-level synthetic benchmarks for multicore systems

One of the commonly used techniques to speedup early architectural exploration and performance evaluation of new hardware architectures is to use synthetic benchmarks. This paper presents a novel automated thread-level synthetic benchmark generation framework with characterization and generation components. The resulting thread-level synthetic benchmarks are fast, portable, human-readable, and ...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2003

Multi-threaded Microprocessors - Evolution or Revolution

نویسنده

چکیده

منابع مشابه

Microprocessors, Digital Signal Processors and Microcontrollers

Evaluating a Multithreaded Superscalar Microprocessor versus a Multiprocessor Chip

Performance of a Micro-threaded Pipeline

Algebraic Models of Simultaneous Multi-Threaded and multi-core Microprocessors

Performance Limits Due to Inter-Cluster Data Forwarding in Wire-Limited ILP Microprocessors

Thread-level synthetic benchmarks for multicore systems

عنوان ژورنال:

اشتراک گذاری